Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This is the first assigment on the University of Helsinki MOOC.
My git repository is in following address: https://github.com/toparvia/IODS-project.git
This is the first entry in the in the learning diary. The first lecture introduced basic methods of collaboration such as using Git for your code and writing reports using Markdown and Knitr. I have used these before, but I’m more familiar using them in UNIX than with Windows computers. I felt quite clumsy these, as in my own PhD project I don’t need to collaborate on the methods part as much. I chose to use Git hub desktop and RStudio as it was recommended in the course materials.
I also went through the R Short and Sweet. I think only trouble was with last assignment. I think my approach of function was in different order and it didn’t accept it at first. Finally I achieved to a solution that satisfied the test. I’m curious to see how the course will develop in the future.
Under here is just some tests of RMarkdown functions I tried:
inline equation: \(A = \pi*r^{2}\)
| Table Header | Second Header |
|---|---|
| Table Cell | Cell 2 |
| Cell 3 | Cell 4 |
Describe the work you have done this week and summarize your learning.
library(ggplot2)
p <-ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# Data exploration of the learning2014 with ggpairs
I found this code quite useful for getting an overview of the data. I have previously used a more simple plotting tools for initial analysis such as correlation or histograms separately, but I might use this one again since it provides insight with only one figure.
I looked in to the data provided by the excercise. The data has a huge number of different parameters, point estimates and some categorial variables. I used the common commands for simple analysis.
require(readr)
PKY <- read_delim("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only
dim(PKY) #dimensions of the data
str(PKY) #structure of the data
summary(PKY) #summary of the data
I used also some other methods for data exploration, but I’ll write about those in the end of this excercise together with the data analysis.
library(data.table)
#Here we make each of the needed variables for practice
gender <- data.table(PKY$gender)
age <- data.table(PKY$Age)
attitude <- data.table(PKY)[, .(Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj)]
attitude <- attitude/10
deep <- data.table(PKY)[, .(D03+D11+D19+D27+D07+D14+D22+D30+D06+D15+D23+D31)]
deep <- deep/12
surf <- data.table(PKY)[, .(SU02+SU10+SU18+SU26+SU05+SU13+SU21+SU29+SU08+SU16+SU24+SU32)]
surf <- surf/12
stra <- data.table(PKY)[, .(ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28)]
stra <- stra/8
points <- data.table(PKY$Points)
#Combining variables to one data.table
learn2014 <- data.table(cbind(gender, age, attitude, deep, stra, surf, points))
#giving column names for the data.table
colnames(learn2014) <- c("gender", "age", "attitude", "deep", "stra", "surf", "points")
#removing the values with the points of zero
learn2014 <- learn2014[points >0]
#writing the file to the data folder
write.csv(learn2014, file = "data/learn2014.csv")
#reading the table
learn2014_reloaded <- read.csv(file = "data/learn2014.csv")
#showing the data summary with reloaded data
summary(learn2014_reloaded)
I produced the required data for with data.table -package, which I had used only briefly before. I thought this is good place to practice. However, I think I could have written this as one command, but I felt more confident doing this as one phase at the time. Next we start the analysis of the data. # Data analysis
learn2014 <- read.csv(file = "data/learn2014.csv")
require(GGally)
require(ggplot2)
#This function enables running linear regression to each parameter combination and comparison with loess (locally weighted scatterplot smoothing) to observe non-linear relationships between parameters, and their correlation for ggplot2 and GGally packages.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(learn2014,columns = c(2:7), lower = list(continuous = my_fn))
g
# Here we can see that combined parameters don't have high correlation between each other, and parameters compared to points, show linear positive relationship other than age and surf. Sex is excluded in this picture.
#Distributions of data look normally distritubed, at least roughly. Age is not normally distributed, perhaps it could be poisson distribution.
p <-ggpairs(learn2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
#Sex seems to affect attitude the most however these differences look rather small.
#I chose the model parameters based on the data exploration I did before.
#Attitude looked promising so it was in all the models.
m1 <- lm(points ~ attitude*deep*stra*surf+age, data=learn2014)
m2 <- lm(points ~ attitude + stra + surf + age + attitude:stra +
attitude:surf + stra:surf + attitude:stra:surf, data = learn2014)
m3 <- lm(points ~ attitude+deep+stra+surf+age, data=learn2014)
m4 <- lm(points ~ attitude + stra + age, data = learn2014)
m5 <- lm(points ~ attitude + stra, data = learn2014)
m6 <- lm(points ~ attitude + stra + gender, data = learn2014)
AIC(m1,m2,m3,m4,m5,m6)
## df AIC
## m1 18 1039.982
## m2 10 1031.755
## m3 7 1029.450
## m4 5 1028.225
## m5 4 1029.038
## m6 5 1030.978
#By Akaike's information criteria, the model with the best fit is m5, since it is also most parsimonius.
par(mfrow=c(2,2))
plot(m5)
#Model looks good. Residuals have quite even distribution on both sides of zero and there are only few outliers. The included parameters probably fulfil the normality asumption and are good predictor combination, since their correlation was low with each other (<0.06).
drop1(m5)
## Single term deletions
##
## Model:
## points ~ attitude + stra
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## attitude 1 1051.89 5611.3 588.41
## stra 1 81.74 4641.1 556.90
#to test whether more parameters would need to be dropped to improve the model.
#Looks like we have the most parsimonious model with the lowest AIC.
summary(m5)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# Attitude (Global attitude toward statistics) is highly significant and stra (Organized Studying and Time management) is border case with p > 0.09. Model explains only 20 % of variation of the data. Perhaps further data wrangling might be in order or using better models.
I did some analysis to find the predictors of the full data, just out interest. I found a different set of predictors that were correlating with the points.
library(caret) #findCorrelation algorithm
library(ggcorrplot) #plot the correlations
#PKY is the full dataset
PKY2 <- PKY[-60] # removing the sex
correlations <- abs(cor(PKY2, use="pairwise.complete.obs"))
#correlation of all parameters with positive numbers
correlations.selected <- findCorrelation(correlations, cutoff = .7, exact = TRUE)
#colums with to select by cutoff 0.7
corr <- cor(PKY2[c(correlations.selected,59)])
#correlations of narrow selection and points variable
# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of PKY",
ggtheme=theme_bw)
What does this do? It takes all the predictors and find all the correlations from the data based on a cutoff, so basically you can look which questions have correlation with each other and then compares it with the value of interest, in this case, the points.
Also one other thing I found handy since I’ve worked with two different computers with this data set, that now I can easily migrate between and just use the git to move the data. I also found this function handy, so that I can use the same script in both computers.
#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only
Basically it sets the working directory to the same as you open the source without changing the settings of R-studio.
This time I had much less time to use for this exercise, so I used the methods given mostly.
Data Set Information:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
# To empty the memory after the excercise before this
rm(list=ls())
# Load data
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
alc <- read.csv(file = "data/ex3alc.csv")
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
#summary(alc)
# The data set is combined from two different files from the machine learning repository
#alc is the full dataset, we take numeric values for correlation analysis
alc2 <- select_if(alc, is.numeric)
correlations <- abs(cor(alc2, use="pairwise.complete.obs"))
#correlation of all parameters with absolute numbers
correlations.selected <- findCorrelation(correlations, cutoff = .3, exact = TRUE)
#colums with to select by cutoff 0.7
#Making sure that grades are part of the correlation table
grade.columns <- 15:17
correlations.selected <- union(correlations.selected,grade.columns )
# calculating the variables
corr <- cor(alc2[c(correlations.selected)])
#correlations of narrow selection and grades variables G1, G2, G3
# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of PKY",
ggtheme=theme_bw)
# Now we see the highest correlation to G1:G3 is with age, Medu, alc_use and Walc. Since alc_use and Walc have high correlation, only better one is selected for the analysis. G3 is the final grade, so we first try to only predict that and drop the G1 and G2, even thought it would be likely that they would be good predictors of G3.
# So we want to predict the Final grade of the students, with their alcohol use, age and their mothers education level.
# I hypotise that alcohol use would reduce the performance and mothers educational level would increase their performance.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(alc2,columns = c(correlations.selected), lower = list(continuous = my_fn))
g
# Lets add some factor variables to the data
names.selected <- colnames(alc2[correlations.selected])
alc3 <- subset(alc, select=c(names.selected, "sex", "school", "guardian", "activities"))
p <-ggpairs(alc3, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
# The categorial variables are quite similar and does not give much indication other than, that the high alcohol use is higher with males and almost no alcohol use is more common with females.
m <- glm(G3 ~ alc_use + Walc + age + Medu, family = "gaussian", data = alc3)
m2 <- glm(G3 ~ alc_use + age + Medu, family = "gaussian", data = alc3)
deviance(m2)/m2$null.deviance
## [1] 0.9052058
AIC(m,m2)
## df AIC
## m 6 1969.777
## m2 5 1969.543
par(mfrow=c(2,2))
plot(m2)
# Akaike information criteria gives similar result, so as hypotised the Walc is dropped as they have covariance, that way we achive the most parsimonious model.
# Model explains (R^2) 91 % of the variation without the help of G1 and G2.
# !!! Oh, I only realized at this point that the target variable was supposed to be high/low alcohol consumption, now I realize why logistic regression in this case... !!!
# I suppose I need to do another selection for parameters for this case..
# Feature selection based on LASSO (least absolute shrinkage and selection operator) using centering and scaling as preprosessing and then 10-fold crossvalidation
fit <- sbf(
form = alc$alc_use ~ .,
data = alc[c(2:27,29:31)], method = "glmnet", # Dalc and Walc are dropped as they are the parameters high_use is based on, D1:D3, dropped as well, since grades are known only after students are done with the studies (especially final exam G3)
tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
preProc = c("center", "scale"),
trControl = trainControl(method = "none"),
sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10)
)
fit
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance:
##
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 0.355 0.8812 0.2448 0.06579 0.02871 0.04269
##
## Using the training set, 15 variables were selected:
## sexM, age, addressU, Fjobservices, reasonother...
##
## During resampling, the top 5 selected variables (out of a possible 20):
## absences (100%), age (100%), failures (100%), freetime (100%), goout (100%)
##
## On average, 14.6 variables were selected (min = 13, max = 17)
# Dalc is dropped as it is one of the parameters high_use is based on
bm <- glm(high_use ~ failures + absences + freetime + age, data = alc, family = "binomial")
bm2 <- glm(high_use ~ failures + absences + freetime, data = alc, family = "binomial")
bm3 <- glm(high_use ~ absences + age + G1+ G2+ G3, data = alc, family = "binomial") # here grades are included
bm4 <- glm(high_use ~ absences + age + G1, data = alc, family = "binomial") # here grades are included
bm5 <- glm(high_use ~ failures + absences + freetime + age + goout, data = alc, family = "binomial")
bm6 <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")
# Stepwise forward and backward selection
step(bm6, direction = "both")
## Start: AIC=406.4
## high_use ~ failures + absences + goout
##
## Df Deviance AIC
## <none> 398.40 406.40
## - failures 1 402.50 408.50
## - absences 1 410.92 416.92
## - goout 1 440.14 446.14
##
## Call: glm(formula = high_use ~ failures + absences + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) failures absences goout
## -3.65199 0.41507 0.07547 0.71311
##
## Degrees of Freedom: 381 Total (i.e. Null); 378 Residual
## Null Deviance: 465.7
## Residual Deviance: 398.4 AIC: 406.4
AIC(bm,bm2,bm3,bm4,bm5,bm6) # better fit found without grades
## df AIC
## bm 5 439.7969
## bm2 4 439.4485
## bm3 6 451.1008
## bm4 4 447.3286
## bm5 6 409.0303
## bm6 4 406.3965
# bm6 is the best fit based, preprosessed parameters by centering and scaling, then using feature selection using LASSO with 10-fold crossvalidation , removing confounding parameters.
# The model was further validated with stepwise forward and backward selection and confirmed checking the Akaike information criteria.
par(mfrow=c(2,2))
plot(bm6)
# The model validation is bit more tricky with binomial models, you won't see such neat plots unless the number of observations is very high (N > 400). In this case plots look ok.
# These are loaded only now, since they mask some functions from dplyr
library(MASS)
library(fitdistrplus)
alc_num <- as.numeric(alc$high_use)
fitBinom=fitdist(data=alc_num, dist="binom", fix.arg=list(size=2), start=list(prob=0.3))
plot(pnbinom(sort(alc_num), size=2, mu=fitBinom$estimate), ppoints(alc_num))
abline(0,1)
# This shows the estimated binomial distribution in our response variable.
# compute odds ratios (OR)
OR <- coef(bm6) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(bm6))
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02593937 0.0107041 0.05891273
## failures 1.51447465 1.0137625 2.27486120
## absences 1.07838996 1.0341084 1.12889344
## goout 2.04033617 1.6296451 2.58472531
# predictions
pred.bm6 <- predict(bm6, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = pred.bm6)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = pred.bm6 > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
alc[36:38] %>% tail(10)
## high_use probability prediction
## 373 FALSE 0.14055409 FALSE
## 374 TRUE 0.36139910 FALSE
## 375 FALSE 0.19198237 FALSE
## 376 FALSE 0.25734107 FALSE
## 377 FALSE 0.15979460 FALSE
## 378 FALSE 0.34330586 FALSE
## 379 FALSE 0.22362093 FALSE
## 380 FALSE 0.06224146 FALSE
## 381 TRUE 0.55365666 TRUE
## 382 TRUE 0.05797935 FALSE
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = c(high_use), y = probability, color = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64397906 0.05759162 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.82460733 0.17539267 1.00000000
# how many are correct
sum(alc[36]==alc[38])/382
## [1] 0.7617801
# 76.178 % are correct
1-sum(alc[36]==alc[38])/382
## [1] 0.2382199
# Training error is 23.821 % (DataCamp is 26 %)
# Here is the DataCamp version of training error, (which I find more complicated, but perhaps easier to follow).
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
# The model is better than the one introduced in DataCamp
# K-fold cross-validation
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = bm6, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2434555
Training error is 23.821 % (not validated) (DataCamp is > 26 %) Cross-validated training error is 24.8699 %
# I found this very useful in the DataCamp exercises
# produce summary statistics by group with piping variable #>#
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade=mean(G3))
This weeks exercise was interesting. In the end I tried some new methods I haven’t tried before, such as the LASSO feature selection. I wonder how the parameters were selected in the DataCamp exercises? Was the method in any way similar than the LASSO approach I used. I also tested LASSO without the preprosessing and the results differed. That model had much poorer results than the one with preprosessing. I’m curious how that can effect the LASSO results so much.
I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.
Today we are looking at the Boston crime
| Variables | Description |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes divided by $1000s. |
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file
# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
More focused figure on correlation, since they are hardly visible with this many variables.
# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))
# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)
# number of rows in the Boston dataset
n <-nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- Boston_scaled[ind,]
# create test set
test <- Boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <-test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
summary(train)
## zn indus chas
## Min. :-0.4872 Min. :-1.5563017 Min. :-0.27233
## 1st Qu.:-0.4872 1st Qu.:-0.8755787 1st Qu.:-0.27233
## Median :-0.4872 Median :-0.2108898 Median :-0.27233
## Mean : 0.0249 Mean : 0.0003721 Mean :-0.01895
## 3rd Qu.: 0.1023 3rd Qu.: 1.0149946 3rd Qu.:-0.27233
## Max. : 3.8005 Max. : 2.4201701 Max. : 3.66477
## nox rm age
## Min. :-1.46443 Min. :-3.87641 Min. :-2.33313
## 1st Qu.:-0.92076 1st Qu.:-0.56095 1st Qu.:-0.89968
## Median :-0.16996 Median :-0.10622 Median : 0.33128
## Mean :-0.01907 Mean : 0.02372 Mean :-0.01358
## 3rd Qu.: 0.59809 3rd Qu.: 0.49439 3rd Qu.: 0.90857
## Max. : 2.72965 Max. : 3.55153 Max. : 1.11639
## dis rad tax
## Min. :-1.26582 Min. :-0.981871 Min. :-1.312691
## 1st Qu.:-0.80063 1st Qu.:-0.637331 1st Qu.:-0.766817
## Median :-0.24835 Median :-0.522484 Median :-0.464213
## Mean : 0.01902 Mean :-0.005106 Mean : 0.001413
## 3rd Qu.: 0.70867 3rd Qu.: 1.659603 3rd Qu.: 1.529413
## Max. : 3.95660 Max. : 1.659603 Max. : 1.796416
## ptratio black lstat
## Min. :-2.704702 Min. :-3.90333 Min. :-1.52961
## 1st Qu.:-0.499104 1st Qu.: 0.21960 1st Qu.:-0.83084
## Median : 0.251492 Median : 0.38442 Median :-0.25879
## Mean : 0.000874 Mean : 0.04244 Mean :-0.04857
## 3rd Qu.: 0.805778 3rd Qu.: 0.43330 3rd Qu.: 0.54501
## Max. : 1.637208 Max. : 0.44062 Max. : 3.40663
## medv crime
## Min. :-1.90634 low :103
## 1st Qu.:-0.56081 med_low :102
## Median :-0.12317 med_high: 99
## Mean : 0.01522 high :100
## 3rd Qu.: 0.27098
## Max. : 2.98650
Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2524752 0.2450495 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.02719692 -0.9098830 -0.119431971 -0.8947828 0.46922567
## med_low -0.07381211 -0.3104388 -0.002135914 -0.5868169 -0.07700492
## med_high -0.39888730 0.2427635 0.085589136 0.3946332 0.15171554
## high -0.48724019 1.0149946 -0.036103054 1.0524455 -0.45913253
## age dis rad tax ptratio
## low -0.9057308 0.9068578 -0.6997720 -0.7503419 -0.43777850
## med_low -0.3875544 0.3617998 -0.5517590 -0.4831187 -0.08723865
## med_high 0.4552940 -0.3849496 -0.4006773 -0.2606792 -0.26500224
## high 0.8225945 -0.8451470 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38512197 -0.78228765 0.5486330
## med_low 0.31663817 -0.18395745 0.0360874
## med_high 0.09186361 -0.04239709 0.1764413
## high -0.63914692 0.83915210 -0.7150953
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.086269609 0.691882974 -0.94332130
## indus 0.001196408 -0.278265069 0.21853965
## chas -0.069622043 0.007691281 0.20901243
## nox 0.267108039 -0.811845455 -1.39472145
## rm -0.078404366 -0.104693071 -0.13980377
## age 0.352794424 -0.300246624 -0.22825470
## dis -0.051388429 -0.272102952 0.04944894
## rad 3.370724609 1.007041764 -0.07786887
## tax 0.107661377 -0.073239043 0.56167289
## ptratio 0.140519722 0.024495054 -0.29336716
## black -0.132994372 0.022289585 0.11467684
## lstat 0.160603834 -0.169803757 0.46935743
## medv 0.157639783 -0.397748004 -0.17573504
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9520 0.0362 0.0118
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g + theme_tufte() + geom_rangeframe()
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 11 1 0
## med_low 2 17 5 0
## med_high 1 10 15 1
## high 0 0 1 26
In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.
# k-means clustering
BostonS <- scale(Boston) # standardize variables
wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km <-kmeans(BostonS, centers = 4)
# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)
We see similar results than in LDA
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Now we can see the 3D picture of the clusters!